Back

Journal of Cheminformatics

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match Journal of Cheminformatics's content profile, based on 25 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
DrugPlayGround: Benchmarking Large Language Models and Embeddings for Drug Discovery

Liu, T.; Jiang, S.; Zhang, F.; Sun, K.; Head-Gordon, T.; Zhao, H.

2026-04-07 bioinformatics 10.64898/2026.04.04.716470 medRxiv
Top 0.1%
12.5%
Show abstract

Large language models (LLMs) are in the ascendancy for research in drug discovery, offering unprecedented opportunities to reshape drug research by accelerating hypothesis generation, optimizing candidate prioritization, and enabling more scalable and cost-effective drug discovery pipelines. However there is currently a lack of objective assessments of LLM performance to ascertain their advantages and limitations over traditional drug discovery platforms. To tackle this emergent problem, we have developed DrugPlayGround, a framework to evaluate and benchmark LLM performance for generating meaningful text-based descriptions of physiochemical drug characteristics, drug synergism, drug-protein interactions, and the physiological response to perturbations introduced by drug molecules. Moreover, DrugPlayGround is designed to work with domain experts to provide detailed explanations for justifying the predictions of LLMs, thereby testing LLMs for chemical and biological reasoning capabilities to push their greater use at the frontier of drug discovery at all of its stages.

2
Pro-GAT: Reconnecting Fragmented PROTACs Using Graph Attention Transformer

Vemuri, S.; Bijigiri, L. P.; Gogte, S.; Kondaparthi, V.

2026-02-23 bioinformatics 10.64898/2026.02.22.707266 medRxiv
Top 0.1%
10.4%
Show abstract

PROTACs work by bringing together a protein-of-interest ligand and an E3 ligase recruiter to trigger targeted degradation. However, Diffusion-based generative models frequently produce chemically invalid or disconnected linker structures that satisfy global geometric constraints but violate local bonding requirements. These models operate in continuous coordinate space and therefore lack explicit mechanisms for enforcing discrete chemical connectivity under fixed-anchor constraints. Invalid, disconnected outputs recur rather than being a rare exception, such that naive resampling is not an effective method to obtain valid chimeras. Pro-GAT is a graph attention-based framework for geometry-preserving molecular graph repair, capable of functioning on chemically disconnected diffusion-generated PROTAC candidates by predicting bounded coordinate corrections and constrained atom-type modifications using geometry-aware graph attention network (GAT) layers. The proposed model is trained on PROTAC datasets with added disconnections to overcome systematic connectivity failures in diffusion-based PROTAC generation with fixed anchors. When combined with DiffPROTACs and DiffLinker, Pro-GAT improves the percentage of chemically valid candidates in the aggregated output from 76.70% to 83.92% and 63.16% to 68.73% while maintaining 80.18% and 63.80% uniqueness levels of valid candidates respectively, thus facilitating the generation of usable PROTAC candidates from invalid diffusion samples. Pro-GAT was used in a case study of the 7Z76 ternary complex to repair DiffPROTACs and DiffLinker generated samples, which gave rise to connected chimeras whose docking scores were comparable to the original 7Z76 structure.

3
SELFormerMM: multimodal molecular representation learning via SELFIES, structure, text, and knowledge graph integration

Ulusoy, E.; Bostanci, S.; Deniz, B. E.; Dogan, T.

2026-03-19 bioinformatics 10.64898/2026.03.17.712340 medRxiv
Top 0.1%
10.4%
Show abstract

MotivationMolecular representation learning is central to computational drug discovery. However, most existing models rely on single-modality inputs, such as molecular sequences or graphs, which capture only limited aspects of molecular behaviour. Yet unifying these modalities with complementary resources such as textual descriptions and biological interaction networks into a coherent multimodal framework remains non-trivial, hindering more informative and biologically grounded representations. ResultsWe introduce SELFormerMM, a multimodal molecular representation learning framework that integrates SELFIES notations with structural graphs, textual descriptions, and knowledge graph- derived biological interaction data. By aligning these heterogeneous views, SELFormerMM effectively captures complementary signals that unimodal approaches often overlook. Our performance evaluation has revealed that SELFormerMM outperforms structure-, sequence-, and knowledge-based models on multiple molecular property prediction tasks. Ablation analyses further indicate that effective cross-modal alignment and modality coverage improve the models ability to exploit complementary information. Overall, integrating SELFIES with structural, textual, and biological context enables richer molecular representations and provides a promising framework for hypothesis-driven drug discovery. AvailabilitySELFormerMM is available as a programmatic tool, together with datasets, pretrained models, and precomputed embeddings at https://github.com/HUBioDataLab/SELFormerMM. Contacttuncadogan@gmail.com

4
EnzySeek: Efficient Exploration of Enzyme Reaction Pathways Using AI Agents

Kang, X.; Yu, T.; Xu, K.; Liu, C.; Wu, R.

2026-03-02 biochemistry 10.64898/2026.03.02.708939 medRxiv
Top 0.1%
10.2%
Show abstract

With the rapid development of Large Language Models (LLMs) and Agent technologies, AI can assist in solving a variety of real-world problems across multiple domains, such as autonomous driving, drug discovery, and materials design. In this work, we present EnzySeek, an enzyme catalysis AI agent designed to assist researchers in enzyme catalysis simulations. First, we constructed a domain-specific knowledge base by curating thousands of papers related to enzyme catalysis. Second, we customized Model Context Protocol (MCP) interfaces for each step of the enzyme catalysis simulation workflow, enabling these functions to be invoked by LLMs. Finally, we configured an agent capable of simultaneously referencing past empirical studies on enzyme catalysis, autonomously executing tool calls, and analyzing as well as presenting the results. EnzySeeks capabilities cover multiple aspects, including protein structure prediction, molecular docking, system preparation and parameterization, molecular dynamics (MD) simulations, and QM/MM calculations. The conclusions drawn by EnzySeek are primarily based on the results of QM/MM calculations. We employed the semi-empirical quantum mechanical method GFN2-xTB to calculate the QM region of the system. Benchmark results indicate that the GFN2-xTB method can achieve high efficiency while maintaining accuracy. The EnzySeek agent is designed to continuously learn from newly published literature and past computational tasks. During its operation, every AI decision is manually verified and scored by human experts. This human-in-the-loop validation provides the AI with sufficient case-based support, ultimately contributing to the full automation of enzyme catalysis computations. All data generated during the simulations are compiled into a dataset, which is used to establish evaluation criteria specific to enzyme catalysis computational results.

5
BioPipelines: Accessible Computational Protein and Ligand Design for Chemical Biologists

Quargnali, G.; Rivera-Fuentes, P.

2026-03-13 bioinformatics 10.64898/2026.03.11.711024 medRxiv
Top 0.1%
10.2%
Show abstract

Deep learning methods for protein structure generation, sequence design, and structure and property prediction have created unprecedented opportunities for protein engineering and drug discovery. However, using these tools often requires navigating incompatible software environments, diverse input/output formats, and high-performance computing infrastructure, any of which may hinder adoption by primarily experimental chemical biology laboratories. Here we present BioPipelines, an open-source Python framework that allows researchers to define multi-step computational design workflows in a few lines of code. Additionally, its robust yet modular architecture provides a straightforward way to expand the toolkit with different functionalities, particularly by leveraging coding agents, with little effort. The framework currently integrates over 30 tools encompassing structure generation, sequence design, structure prediction, compound screening, and analysis. The same workflow code can be prototyped interactively in a Jupyter notebook and then submitted for production-scale runs without modification. We demonstrate applications in inverse folding, gene synthesis, de novo protein design, compound library screening, iterative binding site optimization, and fusion-protein linker optimization. We hope this framework will empower researchers, allowing them to focus on the scientific question rather than computational logistics. BioPipelines is available under the MIT license at https://github.com/locbp-uzh/biopipelines.

6
MOZAIC: Compound Growth via In Silico Reactions and Global Optimization using Conformational Space Annealing

Yoo, J.; Shin, W.-H.

2026-03-10 bioinformatics 10.64898/2026.03.07.710272 medRxiv
Top 0.1%
10.2%
Show abstract

MotivationFragment-based drug discovery (FBDD) is an efficient strategy that leverages small molecular fragments to explore broader chemical space by combining them. Advances in computational methods have enabled the calculation of molecular properties and docking scores, thereby accelerating the development of algorithm- and AI-based approaches in FBDD. However, it should be noted that certain methods do not provide synthetic pathways to obtain the proposed compounds. Consequently, these molecules might not be synthesized easily. ResultsIn light of these developments, we propose MOZAIC, a novel framework that explores chemical space using a reaction-based fragment growing and Conformational Space Annealing, a powerful global optimization algorithm. Our results show that MOZAIC effectively produces chemically diverse molecules with balanced improvements in lead-like properties, including QED, synthetic accessibility, and binding affinity. Furthermore, its flexible objective function allows fine-tuning for specific design goals, such as enhancing solubility with binding affinity. These capabilities position MOZAIC as a valuable platform for advancing fragment-to-lead and lead optimization efforts in drug discovery. Availability and implementationMOZAIC is available at https://github.com/kucm-lsbi/MOZAIC/. Supplementary InformationSupplementary data are available at Bioinformatics online.

7
Assessment of Generative De Novo Peptide Design Methods for G Protein-Coupled Receptors

Junker, H.; Schoeder, C. T.

2026-03-02 bioinformatics 10.64898/2026.02.26.708415 medRxiv
Top 0.1%
9.2%
Show abstract

G protein-coupled receptors (GPCRs) play an ubiquitous role in the transduction of extracellular stimuli into intracellular responses and therefore represent a major target for the development of novel peptide-based therapeutics. In fact, approximately 30% of all non-sensory GPCRs are peptide-targeted, representing a blueprint for the design of de novo peptides, both as pharmacological tools and therapeutics. The recent advances of deep learning-based protein structure generation and structure prediction offer a multitude of peptide design stategies for GPCRs, yet confidence metrics rarely correlate with experimental success. In the context of peptides, this problem is exacerbated due to the lack of elaborate tertiary structures in peptides, raising the question of whether this is due to inadequate sampling or insufficient scoring. In this two-part benchmark, we addressed this question by first simulating the validation process of 124 unique known GPCR-peptide complexes using AlphaFold2 Initial Guess, Boltz-2 and RosettaFold3. We then assessed the peptide sampling capabilities of the respective generative methods BindCraft, BoltzGen and RFdiffusion3. Our results indicate that current design pipelines primarily suffer from significant confidence overestimation for misplaced peptides in the validation phase across all three prediction methods. We further highlight occurrences of significant memorization in both prediction as well as generation of peptides. While all generative methods sample backbone space sufficiently, their simultaneous sequence generation remains subpar and can be partially recovered through the use of ProteinMPNN. Taken together, our benchmark offers guidance for the design of peptides specifically using deep learning-based pipelines. Autor summaryDeep learning-based protein design is revolutionizing computational biology and development of such tools is progressing rapidly with increasing attention from both academic and non-academic institutions. Their applicability and performance is often assessed from an all-purpose objective, with implicit bias towards larger protein-protein interactions. Due to their size, peptides therefore present an edge case where performance is known to decrease compared to larger, more structured proteins. Here, we present a benchmark specifically for the deep learning-based design of peptides targeting G protein-coupled receptors (GPCRs), a major therapeutic drug target family, assessing the generation of novel GPCR-targeting peptides and the validation of these designs separately. Our results show that generative methods sample potential peptide placements and orientations sufficiently but validation fails to differentiate valid from invalid designs, indicating that the so-called scoring problem remains unsolved. Although focusing on a specific use-case, our results are generalizable to the broader field of protein design. Consequently, it can offer guidance for peptide-specific design applications and can contribute to the development and improvement of new methods.

8
LigandForge: A Web Server for Structure-Guided De Novo Drug Design

Nada, H.; Sipos-Szabo, L.; Bajusz, D.; Keseru, G.; Gabr, M.

2026-04-03 bioinformatics 10.64898/2026.03.31.715741 medRxiv
Top 0.1%
8.7%
Show abstract

Despite advances in computational drug discovery, de novo drug design remains hindered by high licensing costs and the need for specialized programming expertise. We present LigandForge, a webserver for structure-guided de novo ligand generation. LigandForge integrates structural validation and binding-site characterization; voxel-based property grid construction for spatial mapping of electrostatics and hydrophobicity; chemistry-aware fragment assembly; multi-objective lead optimization; and retrosynthetic feasibility analysis. The platform utilizes a structure-guided framework to assemble molecules from curated fragment libraries while enforcing physicochemical constraints, including molecular weight, LogP, and hybridization states. Generated molecules are refined via reinforcement learning and genetic algorithms which are subsequently evaluated using composite metrics such as the quantitative estimate of drug-likeness. By leveraging RDKit for cheminformatics and NGL viewer for real-time 3D visualization, LigandForge provides a synthesis-aware environment that bridges the gap between macromolecular structural data and experimentally feasible lead compounds without requiring local software installation.

9
Artemis: Harnessing Knowledge Graphs for Next-Generation Drug Target Prioritization

Kiselev, V. Y.; Ainscow, E.

2026-01-29 bioinformatics 10.64898/2026.01.27.701959 medRxiv
Top 0.1%
6.4%
Show abstract

Knowledge graphs (KGs) have become an important asset in biomedical research and drug discovery by enabling the structured integration of heterogeneous biological knowledge. When combined with machine learning (ML), KGs support the identification of novel drug-target relationships, but existing approaches are often KG-centric, relying primarily on graph structure and embeddings while overlooking disease-specific biological and clinical context. Moreover, many high-impact applications depend on proprietary KG infrastructures, limiting accessibility for the broader research community. Here, we introduce Artemis, a practical and generalisable machine-learning framework for indication-aware target prioritisation that integrates public biomedical KGs with clinical evidence from the ChEMBL database. Artemis derives graph-based representations of clinically validated drug targets from multiple publicly available KGs and augments them with disease-relevant clinical features from ChEMBL. This hybrid feature space is used to train supervised ML models across seven disease indications, with performance assessed via cross-validation and guided parameter optimisation. The framework is further evaluated on emerging breast cancer targets reported at the San Antonio Breast Cancer Symposium 2024, demonstrating its ability to prioritise novel candidates. Overall, this work demonstrates that publicly available KGs can be used for actionable, translational target discovery when coupled with clinical data. Artemis provides an accessible, scalable, and cost-efficient alternative to proprietary KG platforms. Thereby offering a practical solution for researchers seeking to prioritise therapeutic targets in real-world drug discovery settings. Key PointsO_LIKG applications can support the identification of novel drug-target relationships but rely primarily on graph structure while overlooking disease-specific biological and clinical context. C_LIO_LIArtemis performs indication-aware target prioritisation that integrates public biomedical KGs with clinical evidence from the ChEMBL database. C_LIO_LIArtemis is evaluated on emerging breast cancer targets reported at the San Antonio Breast Cancer Symposium 2024, demonstrating its ability to prioritise novel candidates. C_LIO_LIArtemis provides an accessible, scalable, and cost-efficient alternative to proprietary KG platforms offering a practical solution for researchers seeking to prioritise therapeutic targets in real-world drug discovery settings. C_LI

10
Seq2Pocket: Augmenting protein language models for spatially consistent binding site prediction

Skrhak, V.; Polak, L.; Novotny, M.; Hoksza, D.

2026-01-31 bioinformatics 10.64898/2026.01.28.702257 medRxiv
Top 0.1%
6.2%
Show abstract

Protein-ligand binding site prediction (LBS) is important for many domains including computational drug discovery, where, as in other tasks, protein language models (pLMs) have shown a great promise. In their application to LBS, the pLM classifies each amino acid as binding or not. Subsequently, for the purposes of downstream analysis, these predictions are mapped onto the structure, forming structure-continuous pockets. However, their residue-oriented nature often results in spatially fragmented predictions. We present a comprehensive framework (Seq2Pocket) that addresses this by combining finetuned pLM with an embedding-supported smoothing classifier and an optimized clustering strategy. While finetuning on our enhanced scPDB dataset yields state-of-the-art results, outperforming existing predictors by up to 11% in DCC recall, the smoothing classifier restores pocket continuity. Next, we introduce the Pocket Fragmentation Index (PFI) and use it to select a clustering approach that preserves a consistent mapping between predictions and ground-truth pockets. Validated on the LIGYSIS and CryptoBench benchmarks, our approach ensures that pLM-based predictions are not only statistically accurate but also useful for downstream drug discovery, while maintaining state-of-the-art performance.

11
CrossAffinity: A Sequence-Based Protein-Protein Binding Affinity Prediction Tool Using Cross-Attention Mechanism

Guan, J. S.; Wang, Z.; Mu, Y.

2026-02-23 bioinformatics 10.64898/2026.02.22.707318 medRxiv
Top 0.1%
6.2%
Show abstract

Protein-protein binding affinity is important for understanding protein interactions within a protein complex and for identifying strong drug-peptide binders to a target protein. Many structure-based models were built previously with reasonable performance. However, such models require protein complex structure as input, which is usually unavailable due to high cost and experimental constraints. To tackle such an issue, the sequence-based CrossAffinity model was constructed in this study, using the cross-attention module to extract contextual information of interacting protein components while separating the protein complex into two distinct parts to predict the protein-protein binding affinity. CrossAffinity managed to outperform all structure-based models and sequence-based models in an S34 test set containing newer protein complex structures and binding affinity values in a timeline while being trained on an older dataset, showing generalisability to new data points. In other test sets, namely S90, S90 subset and S79*, CrossAffinity also managed to outperform all other sequence-based models while maintaining comparable performance to many recently published structure-based models. The acceptable performance and quick inference of CrossAffinity enable it to be deployed in situations requiring the prediction of the binding affinity of many protein complexes that lack structural information.

12
MetaReact: A Reaction-Aware Transformer for End-to-End Prediction of Drug Metabolism

Wang, Y.; Rao, J.; Zhang, W.; Shi, Y.; Zeng, C.; Cui, R.; Wang, Y.; Xiong, J.; Li, X.; Zheng, M.

2026-03-18 biochemistry 10.64898/2026.03.14.711529 medRxiv
Top 0.1%
6.2%
Show abstract

Accurate prediction of drug metabolites and enzyme selectivity is essential for rational drug design and safety assessment. However, existing computational approaches are often limited to specific enzyme families or reaction types, lacking the capacity to model enzyme-subtype specificity and prioritize major metabolites. Here, we present MetaReact, an end-to-end generalizable Transformer-based model that unifies the prediction of metabolic enzymes, metabolites, and sites of metabolism (SOM). By integrating structure-aware encoding ReactSeq, a chemistry reaction-based pretraining, MetaReact consistently outperforms state-of-the-art methods across multiple benchmarks under three settings: enzyme-agnostic, enzyme-completion, enzyme-conditioned. Notably, it achieves 60% Top-3 accuracy in identifying major metabolites and superior CYP450 enzyme-subtype prediction/SOM recognition. Case studies validate its applicability to complex natural products, synthetic cannabinoids, and clinical candidates, facilitating toxicity assessment and molecular optimization. This scalable, rule-free solution advances human metabolism modeling, with potential for computational pharmacokinetics and early drug discovery.

13
An Energy Landscape Approach to Miniaturizing Enzymes using Protein Language Model Embeddings

Lala, J.; Agrawal, H.; Dong, F.; Wells, J.; Angioletti-Uberti, S.

2026-03-05 bioinformatics 10.64898/2026.03.04.709378 medRxiv
Top 0.1%
4.8%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWWe present a general approach to find amino acid sequences corresponding to the most compact enzyme likely to retain the structure of a given catalytic site. Our approach is based on using Monte Carlo (MC) simulations to sample an energy landscape where minima correspond, by construction, to sequences with the aforementioned properties. Building on previous work (Wu et al., 2025) and with the BAGEL package (Lala et al., 2025), we implement a route to achieve this goal using only the information extracted from a protein language model (PLM), without structural information. After generating a set of candidate sequences with this PLM-guided BAGEL optimization, we further filter potential candidates for downstream experimental validation using a two-stage protocol. First, deep-learning-based structure prediction models (ESMFold, Chai-1, Boltz-2) are used to identify a structural consensus among designs with highly conserved active-site geometries, yielding many candidates with active-site RMSD below a few angstroms relative to the wild-type and pLDDT scores above 80. Second, molecular dynamics simulations are performed on a filtered subset of sequences (based on active-site RMSD and SolubleMPNN log-likelihoods) to evaluate active-site stability when including thermal fluctuations. For the most promising enzymes, these yield RMSF values in the active site below 1.0 [A] and an active-site RMSD drift between 0.5 and 1.5 [A], making these mini-variants comparable to the wild type, though outcomes vary across enzymes. Given the protocols generality, we believe these results represent a step forward in AI-guided enzyme design. To facilitate rapid experimental validation by the broader community, we open-source all sequences generated by our computational pipeline. These include designs for four representative enzymes of this study: PETase, subtilisin Carlsberg (serine protease), Taq DNA polymerase, and VioA.

14
Joint Modeling of Transcriptomic and Morphological Phenotypes for Generative Molecular Design

Verma, S.; Wang, M.; Jayasundara, S.; Malusare, A. M.; Wang, L.; Grama, A.; Kazemian, M.; Lanman, N. A.

2026-02-04 bioinformatics 10.64898/2026.02.02.703193 medRxiv
Top 0.1%
4.5%
Show abstract

MotivationPhenotypic drug discovery generates rich multi-modal biological data from transcriptomic and morphological measurements, yet translating complex cellular responses into molecular design remains a computational bottleneck. Existing generative methods operate on single modalities and condition on post-treatment measurements without leveraging paired control-treatment dynamics to capture perturbation effects. ResultsWe present Pert2Mol, the first framework for multi-modal phenotype-to-structure generation that integrates transcriptomic and morphological features from paired control-treatment experiments. Pert2Mol employs bidirectional cross-attention between control and treatment states to capture perturbation dynamics, conditioning a rectified flow transformer that generates molecular structures along straight-line trajectories. We introduce Student-Teacher Self-Representation (SERE) learning to stabilize training in high-dimensional multi-modal spaces. On the GDP dataset, Pert2Mol achieves Frechet ChemNet Distance of 4.996 compared to 7.343 for diffusion baselines and 59.114 for transcriptomics-only methods, while maintaining perfect molecular validity and appropriate physicochemical property distributions. The model demonstrates 84.7% scaffold diversity and 12.4 times faster generation than diffusion approaches with deterministic sampling suitable for hypothesis-driven validation. AvailabilityCode and pretrained models will be available at https://github.com/wangmengbo/Pert2Mol.

15
Influence of molecular representation and charge on protein-ligand structural predictions by popular co-folding methods

Bugrova, A.; Orekhov, P.; Gushchin, I.

2026-02-18 bioinformatics 10.64898/2026.02.18.706547 medRxiv
Top 0.1%
4.3%
Show abstract

Recently developed deep learning-based tools can effectively generate structural models of complexes of proteins and non-proteinaceous compounds. While some of their predictive capabilities are truly exciting, others remain to be thoroughly tested. Here, we probe whether the ligand input format (Chemical Component Dictionary, CCD, or Simplified Molecular Input Line Entry System, SMILES) and charge (which depends on protonation) will affect the results of the predictions by four popular algorithms: AlphaFold 3, Boltz-2, Chai-1, and Protenix-v1. We chose methylamine and acetic acid as two of the simplest titratable chemicals that are omnipresent in proteins as amino and carboxy moieties, and are consequently ubiquitous in the Protein Data Bank models that are most commonly used for training. Unexpectedly, we found that for both molecules, in many cases the input format affected the prediction results, and did it much stronger compared to protonation, whereas changes in the formally specified charge of the molecules did not lead to changes in binding expected from experiments. We conclude that (i) ensuring identical results irrespective of input formats and (ii) inclusion of protonation-related steps into training and prediction pipelines are the two available paths for improvement of protein-ligand structure prediction algorithms.

16
Drug-Target Interaction Prediction with PIGLET

Carpenter, K. A.; Altman, R. B.

2026-02-18 bioinformatics 10.64898/2026.02.18.706530 medRxiv
Top 0.1%
4.3%
Show abstract

Drug-target interaction (DTI) prediction is a key task for computed-aided drug development that has been widely approached by deep learning models. Despite extremely high reported performance, these models have yet to find widespread success in accelerating real-world drug discovery. In contrast with the most common approach of creating embeddings from one-dimensional or three-dimensional representations of the input drug and input target, we create a novel graph transformer method for DTI prediction that operates on a proteome-wide knowledge graph of binding pocket similarity, protein-protein interactions, drug similarity, and known binding relationships. We benchmark our method, named PIGLET, against existing DTI prediction models on the Human dataset. We assess performance with two different splitting strategies: the frequently-reported random split, and a novel, more rigorous drug-based split. All models perform similarly well on the random split, and PIGLET outperforms all models on the drug-based split. We highlight the utility of PIGLET through a real-world drug discovery case study.

17
Macro-Equi-Diff (MED): Scaffold-based Macrocycles Generation Using Equivariant Diffusion

Kambampati, S. S.; Anumandla, S.; Guttula, S. L.; Kavadi, V. R.; Gogte, S.; Kondaparthi, V.

2026-02-06 bioinformatics 10.64898/2026.02.05.703948 medRxiv
Top 0.1%
4.3%
Show abstract

Macrocyclic compounds are essential in drug discovery as they can modulate protein-protein interactions and enhance selectivity. Their structural complexity enables access to molecular diversity beyond traditional small molecules; however, designing feasible macrocycles remains a challenging task. Current computational methods often fail to generate macrocycles with proper drug-like properties. Here, we present Macro-Equi-Diff (MED), a deep learning framework that combines transformer-based site identification with an E(3)-equivariant Diffusion Model (EDM) for linker creation, and a fragment-linker attachment module. MED transforms acyclic molecules into structurally consistent macrocycles. MED was tested on the ZINC dataset, achieving high validity (93.92%), uniqueness (99.94%), macrocyclization (99.92%), and linker novelty (82.81%). MED improves upon previous methods that lack a macrocyclic geometry context. As a case study, MED was used to macrocyclize four acyclic drugs targeting the JAK2 protein. The generated macrocycles exhibited favourable molecular descriptors and strong binding affinities, establishing MED as a reliable method for expanding the macrocyclic chemical space.

18
A Multi-Modal AI/ML-based Framework for Protein Conformation Selection and Prediction in Drug Discovery Applications

Gupta, S.; Menon, V.; Baudry, J.

2026-02-18 molecular biology 10.64898/2026.02.17.706293 medRxiv
Top 0.1%
4.1%
Show abstract

The development of pharmaceutical drugs is a time-intensive and costly process, with more than 90% of drug candidates failing during preclinical or clinical testing. A major challenge lies in accurately predicting protein-ligand interactions, especially given that traditional computational methods often rely on a single protein conformation, failing to capture biologically relevant structural variability. To address this, we present an AI/ML-based multi-modal framework based on Graph Convolutional Network (GCN) that integrates both global and local protein descriptors to classify binding and non-binding conformations more effectively. Global descriptors capture overarching physico-chemical and structural properties of proteins, while local descriptors--such as pharmacophores--provide site-specific information crucial for modeling ligand interactions. Our GCN based approach demonstrates that integrating local and global structural perspectives significantly improves predictive accuracy and robustness. By enabling more reliable protein conformation classification, this work contributes toward scalable, AI-driven drug discovery--an increasingly critical goal in response to global health challenges.

19
ABFormer: A Transformer-based Model to Enhance Antibody-Drug Conjugates Activity Prediction through Contextualized Antibody-Antigen Embedding

Katabathuni, R.; Loka, V.; Gogte, S.; Kondaparthi, V.

2026-02-05 bioinformatics 10.64898/2026.02.03.703522 medRxiv
Top 0.1%
4.0%
Show abstract

Computational screening is increasingly becoming a crucial aspect of Antibody-Drug Conjugate (ADC) research, allowing the elimination of dead ends at earlier stages and concentrating on potential candidates, which can significantly reduce the cost of development. The current state-of-the-art deep learning model, ADCNet, usually considers antibodies, antigens, linkers, and payloads as distinct features. However, this overlooks the complex context of antibody-antigen binding, which is primarily responsible for the targeting and uptake of ADCs. To address this limitation, we present ABFormer, a transformer-based framework tailored for ADC activity prediction and in-silico triage. ABFormer integrates high-resolution antibody-antigen interface information through a pretrained interaction encoder and combines it with chemically enriched linker and payload representations obtained from a fine-tuned molecular encoder. This multi-modal design replaces naive feature concatenation with biologically informed contextual embeddings that more accurately reflect molecular recognition. ABFormer outperforms in leave-pair-out evaluation and achieves 100% accuracy on a separate test set of 22 novel ADCs, while the baselines are severely mis-calibrated. Ablation study confirms that the predictive capability is predominantly driven by interaction-aware antibody-antigen representations, while small-molecule encoders enhance specificity by reducing false positives. In conclusion, ABFormer provides a reliable and efficient platform for early filtering of ADC activity and selection of candidates.

20
ToxiVerse: A Public Platform for Chemical Toxicity Data Sharing and Customizable Predictive Modeling

Durai, P.; Russo, D. P.; Shen, Y.; Wang, T.; Chung, E.; Li, L.; Zhu, H.

2026-03-02 bioinformatics 10.64898/2026.02.26.708255 medRxiv
Top 0.1%
4.0%
Show abstract

Chemical toxicity assessment is critical for drug development and environmental safety. Computational models have emerged as a promising alternative to animal testing and now play a significant role in efficiently evaluating new chemicals. To address the urgent need for providing user-friendly machine learning tools in computational toxicology, we developed ToxiVerse, a public web-based platform. It provides curated toxicity datasets, automatic chemical bioprofiling, and a predictive modeling interface designed for researchers who lack programming expertise. The platform comprises three integrated modules: (i) the Bioprofiler module, which provides chemical descriptors by combining chemical-bioactivity data from PubChem assay with a machine learning-based data gap-filling procedure; (ii) the Database module, which hosts around 50,000 curated unique chemicals covering diverse toxicity endpoints; and (iii) the Cheminformatics module, which allows users to upload their own datasets, use datasets from ToxiVerse, or retrieve existing data from PubChem; perform chemical curation; and automatically generate Quantitative Structure-Activity Relationship (QSAR) models to predict chemicals of interest. ToxiVerse enables researchers to carry out bioprofiling, access curated toxicity datasets, and evaluate chemical toxicity through machine learning-based modeling and prediction. The platform is supported by sample files and a detailed tutorial, and it is freely accessible at www.toxiverse.com. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/708255v1_ufig1.gif" ALT="Figure 1"> View larger version (22K): org.highwire.dtl.DTLVardef@d92764org.highwire.dtl.DTLVardef@a92f4aorg.highwire.dtl.DTLVardef@15fa39corg.highwire.dtl.DTLVardef@1ee89bc_HPS_FORMAT_FIGEXP M_FIG C_FIG